Code
library(tidyverse)
library(ggplot2)
library(readr)
library(corrplot)
library(reshape2)
library(corrgram)
library(ggpubr)
library(caTools)
library(GGally)
library(stats)
library(ggsci)
Niharika Pola
December 13, 2022
What is takes for a country or a continent to be happy or sad? Is it the higher GDP per capita, lower Population growth, lower Suicide rate or Urbanization? What are the factors that affect a country’s or continents overall happiness? Can we predict the happiness score? The curiosity to find answers to these questions made me explore the world happiness scores and world data by country.
The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields – economics, psychology, survey analysis, national statistics, health, public policy and more – describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.
The data sets that are used in this analysis are 1) world happiness report 2021 and 2) kaggle’s world data by country 2020. Happiness scores from first data set and few variables from the second data set are being used to analyse the following research question.
I wish to test the following hypothesis,
This study aims to test the above hypothesis while comparing the GDP per capital and Urbanization with other variables including: Population growth, Life Expectancy, Sex-Ratio and Suicide Rate.
Rows: 149 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country name, Regional indicator
dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1] "Country" "Happiness"
Rows: 191 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): GDP per capita
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 218 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Urbanization rate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 207 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Population growth
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 185 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Life expectancy
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 182 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Suicide rate
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 226 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, ISO-code
dbl (1): Sex-ratio
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#merging the above datasets by country and iso code
world_data <- merge(GDP_per_capita, Urbanization_rate)
world_data <- merge(world_data, Population_growth)
world_data <- merge(world_data, Life_expectancy)
world_data <- merge(world_data, Suicide_rate)
world_data <- merge(world_data, Sex_ratio)
#merging happiness scores and world data
happy <- merge(world_data, hap_sc)
happy <- rename(happy, "GDP.per.capita"="GDP per capita", "urbanization.rate"="Urbanization rate", "Population.growth"="Population growth", "Life.expectancy"="Life expectancy", "Suicide.rate"="Suicide rate", "Sex.ratio"="Sex-ratio")
str(happy)
'data.frame': 199 obs. of 9 variables:
$ Country : chr "Afghanistan" "Algeria" "Argentina" "Armenia" ...
$ ISO-code : chr "AFG" "DZA" "ARG" "ARM" ...
$ GDP.per.capita : num 2182 16091 19971 11845 54799 ...
$ urbanization.rate: num 26 73.7 92.1 63.3 86.2 58.7 56.4 89.5 38.2 79.5 ...
$ Population.growth: num 2.41 1.89 0.88 0.17 1.6 0.46 1.35 1.92 1.19 -0.1 ...
$ Life.expectancy : num 64.5 76.7 76.5 74.9 83.3 81.4 72.9 77.2 72.3 74.6 ...
$ Suicide.rate : num 6.4 3.3 9.1 5.7 11.7 11.4 2.6 5.7 6.1 21.4 ...
$ Sex.ratio : num 1.03 1.03 0.98 0.95 0.99 0.96 0.98 1.53 0.97 0.87 ...
$ Happiness : num 2.52 4.89 5.93 5.28 7.18 ...
#adding continents data to final dataset
happy$Continent <- NA
happy$Continent[which(happy$Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
"Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
"South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
"Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
"Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
"Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
"Cambodia", "Afghanistan", "Yemen", "Syria"))] <- "Asia"
happy$Continent[which(happy$Country %in% c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
"Netherlands", "Sweden", "Austria", "Ireland", "Germany",
"Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
"Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
"Russia", "Lithuania", "Latvia", "Moldova", "Romania",
"Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
"Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
"Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
"Bulgaria", "Albania", "Ukraine"))] <- "Europe"
happy$Continent[which(happy$Country %in% c("Canada", "Costa Rica", "United States", "Mexico",
"Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala",
"Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
"Haiti"))] <- "North America"
happy$Continent[which(happy$Country %in% c("Chile", "Brazil", "Argentina", "Uruguay",
"Colombia", "Ecuador", "Bolivia", "Peru",
"Paraguay", "Venezuela"))] <- "South America"
happy$Continent[which(happy$Country %in% c("New Zealand", "Australia"))] <- "Australia"
happy$Continent[which(is.na(happy$Continent))] <- "Africa"
# Moving the continent column's position in the dataset to the second column
happy <- happy %>% select(Country,Continent, everything())
happy <- happy[!duplicated(happy), ]
#Renaming the final dataframe to happy
view(happy)
happy <- happy[-(48:110), ]
str(happy)
'data.frame': 136 obs. of 10 variables:
$ Country : chr "Afghanistan" "Algeria" "Argentina" "Armenia" ...
$ Continent : chr "Asia" "Africa" "South America" "Asia" ...
$ ISO-code : chr "AFG" "DZA" "ARG" "ARM" ...
$ GDP.per.capita : num 2182 16091 19971 11845 54799 ...
$ urbanization.rate: num 26 73.7 92.1 63.3 86.2 58.7 56.4 89.5 38.2 79.5 ...
$ Population.growth: num 2.41 1.89 0.88 0.17 1.6 0.46 1.35 1.92 1.19 -0.1 ...
$ Life.expectancy : num 64.5 76.7 76.5 74.9 83.3 81.4 72.9 77.2 72.3 74.6 ...
$ Suicide.rate : num 6.4 3.3 9.1 5.7 11.7 11.4 2.6 5.7 6.1 21.4 ...
$ Sex.ratio : num 1.03 1.03 0.98 0.95 0.99 0.96 0.98 1.53 0.97 0.87 ...
$ Happiness : num 2.52 4.89 5.93 5.28 7.18 ...
The final data set “happy” consists of 12 variables and 136 observations. We can start the analysis now.
Warning in ggcorr(happy, label = TRUE, label_round = 2, label_size = 3.5, : data
in column(s) 'Country', 'Continent', 'ISO-code' are not numeric and were ignored
Observations from the correlation plot:
The happiness score is,
Strongly correlating with GDP per capita, Urbanization rate and Life expectancy.
Inversely correlated with suicidal rate and population growth.
No correlation with sex ratio.
Attaching package: 'maps'
The following object is masked from 'package:purrr':
map
world.data <- map_data("world")
map.world_joined <- left_join(world.data, plot_2, by = c('region'='Country'))
map1 <- ggplot(map.world_joined, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Happiness), color="black")
map2 <- map1 + scale_fill_gradient(name = "Happiness Score", low = "yellow", high = "red", na.value = "grey50") +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
rect = element_blank())
map2
North America, Europe and Australia appear to be the happiest.
South America, North Asia, Western Africa and countries/islands in the indian ocean appear to be happy on average.
South Africa, India are unhappy and of course Afghanistan is the unhappiest country in the world.
####### Happiness score for each continent
gg1 <- ggplot(happy,
aes(x=Continent,
y=Happiness,
color=Continent))+
geom_point() + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg2 <- ggplot(happy , aes(x = Continent, y = Happiness)) +
geom_boxplot(aes(fill=Continent)) + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg3 <- ggplot(happy,aes(x=Continent,y=Happiness))+
geom_violin(aes(fill=Continent),alpha=0.7)+ theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
# Compute descriptive statistics by groups
stable <- desc_statby(happy, measure.var = "Happiness",
grps = "Continent")
stable <- stable[, c("Continent","mean","median")]
names(stable) <- c("Continent", "Mean of happiness score","Median of happiness score")
# Summary table plot
stable.p <- ggtexttable(stable,rows = NULL,
theme = ttheme("classic"))
ggarrange(gg1, gg2, ncol = 1, nrow = 2)
As we have seen before, Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.
Let’s see the correlation between happiness score and 3 major factors in the happiness data set that are GDP per capita, Life Expectancy, Urbanization rate to understand how they are varying.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = GDP.per.capita, y = Happiness)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line Happiness vs GDP per capita")
`geom_smooth()` using formula = 'y ~ x'
The correlation between GDP per capita and happiness score in Europe, North America, and Asia is more significant than the other continents. Worth mentioning that we will not take Australia into account because there are just two countries in Australia and creating scatter plot with the regression line for this continent will not give us any insight.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = urbanization.rate, y = Happiness)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line Happiness vs Urbanization rate")
`geom_smooth()` using formula = 'y ~ x'
The relationship between happiness score and urbanization rate is very significant in Asia, Africa and Europe followed by North america and south america.
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Life.expectancy, y = Happiness)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line Happiness vs Life Expectancy")
`geom_smooth()` using formula = 'y ~ x'
The relationship between happiness score and life expectancy is very significant in Asia, Europe and Africa followed by North america and south america.
From the above visualizations and analysis, we have found that GDP per capita, Urbanization and Life Expectancy play major role in the happiness score. Whereas suicide rate and population growth has an inverse relationship. Let’s build a model for estimating happiness scores based on these control variables and see which of these has effect on the overall happiness scores.
Call:
lm(formula = Happiness ~ GDP.per.capita + urbanization.rate +
Life.expectancy, data = happy)
Residuals:
Min 1Q Median 3Q Max
-1.89896 -0.37351 0.07498 0.48488 1.15628
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.510e-01 7.280e-01 0.620 0.5366
GDP.per.capita 1.790e-05 3.867e-06 4.628 8.72e-06 ***
urbanization.rate 6.992e-03 3.646e-03 1.918 0.0573 .
Life.expectancy 5.814e-02 1.156e-02 5.030 1.57e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6355 on 132 degrees of freedom
Multiple R-squared: 0.6693, Adjusted R-squared: 0.6618
F-statistic: 89.07 on 3 and 132 DF, p-value: < 2.2e-16
It can be seen that p-value of the F-statistic is < 2.2e-16, which is highly significant. This means that, at least, one of the predictor/control variables ( GDP per capita, Urbanization rare and Life Expectancy) is significantly related to the outcome variable (Happiness score).
It can be seen that, changing in GDP per capita and Life expectancy are significantly associated to changes in happiness scores while changes in urbanization rate is not significantly associated with sales.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.510470e-01 7.280370e-01 0.6195386 5.366295e-01
GDP.per.capita 1.789683e-05 3.867139e-06 4.6279250 8.717287e-06
urbanization.rate 6.992347e-03 3.645755e-03 1.9179423 5.727811e-02
Life.expectancy 5.814042e-02 1.155817e-02 5.0302459 1.567648e-06
2.5 % 97.5 %
(Intercept) -9.890821e-01 1.891176e+00
GDP.per.capita 1.024724e-05 2.554641e-05
urbanization.rate -2.193158e-04 1.420401e-02
Life.expectancy 3.527723e-02 8.100362e-02
Lets see how this model behaves if we apply log on the variables
Call:
lm(formula = Happiness ~ log(GDP.per.capita + urbanization.rate +
Life.expectancy), data = happy)
Residuals:
Min 1Q Median 3Q Max
-2.32889 -0.48506 0.05291 0.51078 1.36460
Coefficients:
Estimate Std. Error
(Intercept) -1.66107 0.49628
log(GDP.per.capita + urbanization.rate + Life.expectancy) 0.75478 0.05177
t value Pr(>|t|)
(Intercept) -3.347 0.00106 **
log(GDP.per.capita + urbanization.rate + Life.expectancy) 14.581 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.682 on 134 degrees of freedom
Multiple R-squared: 0.6134, Adjusted R-squared: 0.6105
F-statistic: 212.6 on 1 and 134 DF, p-value: < 2.2e-16
The R-Squared value drops from 0.6693 to 0.6134 by applying log for model-1.
Call:
lm(formula = Happiness ~ GDP.per.capita + Life.expectancy, data = happy)
Residuals:
Min 1Q Median 3Q Max
-2.01976 -0.39026 0.09436 0.49252 1.18774
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.683e-01 7.201e-01 0.234 0.816
GDP.per.capita 2.028e-05 3.699e-06 5.481 2.05e-07 ***
Life.expectancy 6.713e-02 1.067e-02 6.292 4.20e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6418 on 133 degrees of freedom
Multiple R-squared: 0.6601, Adjusted R-squared: 0.655
F-statistic: 129.2 on 2 and 133 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.683136e-01 7.200997e-01 0.2337366 8.155489e-01
GDP.per.capita 2.027701e-05 3.699299e-06 5.4813121 2.046209e-07
Life.expectancy 6.713497e-02 1.066984e-02 6.2920307 4.201007e-09
It turns out Life Expectancy is strongly associated with happiness scores and GDP per capita follows life expectancy.
2.5 % 97.5 %
(Intercept) -1.256016e+00 1.592643e+00
GDP.per.capita 1.295994e-05 2.759408e-05
Life.expectancy 4.603044e-02 8.823951e-02
However the R-squared values of Model-1 and Model-2 are approximately around 65-66%. Hence we cannot say which model is best fit to predict the happiness scores.
Lets consider all the 5 control variables and see the results.
Lets see how model-2 behaves if we apply log on the variables
Call:
lm(formula = Happiness ~ log(GDP.per.capita + Life.expectancy),
data = happy)
Residuals:
Min 1Q Median 3Q Max
-2.32936 -0.47756 0.04899 0.51212 1.36194
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.62336 0.49427 -3.284 0.0013 **
log(GDP.per.capita + Life.expectancy) 0.75129 0.05158 14.565 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6824 on 134 degrees of freedom
Multiple R-squared: 0.6129, Adjusted R-squared: 0.61
F-statistic: 212.1 on 1 and 134 DF, p-value: < 2.2e-16
The R-Squared value drops from 0.66 to 0.61 by applying log for model-2.
Call:
lm(formula = Happiness ~ GDP.per.capita + urbanization.rate +
Life.expectancy + Suicide.rate + Population.growth, data = happy)
Residuals:
Min 1Q Median 3Q Max
-1.87156 -0.39252 0.05248 0.45831 1.25675
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.712e-02 1.268e+00 -0.021 0.982977
GDP.per.capita 1.646e-05 4.281e-06 3.845 0.000188 ***
urbanization.rate 6.826e-03 3.647e-03 1.872 0.063456 .
Life.expectancy 6.341e-02 1.667e-02 3.804 0.000218 ***
Suicide.rate 1.586e-02 1.317e-02 1.204 0.230615
Population.growth -1.622e-02 7.576e-02 -0.214 0.830844
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6348 on 130 degrees of freedom
Multiple R-squared: 0.6751, Adjusted R-squared: 0.6626
F-statistic: 54.01 on 5 and 130 DF, p-value: < 2.2e-16
The R-squared remains same after adding additional two control variables. So far, the second model fits the happiness scores.
Lets see how model-3 behaves if we apply log on the control variables
Call:
lm(formula = Happiness ~ log(GDP.per.capita + urbanization.rate +
Life.expectancy + Suicide.rate + Population.growth), data = happy)
Residuals:
Min 1Q Median 3Q Max
-2.32855 -0.48525 0.05338 0.51003 1.36226
Coefficients:
Estimate
(Intercept) -1.67866
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) 0.75648
Std. Error
(Intercept) 0.49727
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) 0.05186
t value
(Intercept) -3.376
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) 14.587
Pr(>|t|)
(Intercept) 0.000963
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) < 2e-16
(Intercept) ***
log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6818 on 134 degrees of freedom
Multiple R-squared: 0.6136, Adjusted R-squared: 0.6107
F-statistic: 212.8 on 1 and 134 DF, p-value: < 2.2e-16
The R-squared values drops from 0.66 to 0.61 by application of log.
Models | Multiple R-squared | Adjusted R-squared | Significant Variables |
---|---|---|---|
Model-1 (GDP per capita+Urbanization rate+Life expectancy) | 0.6693 | 0.6618 | GDP per capita & Life Expectancy |
Model-1 (Log(GDP per capita+Urbanization rate+Life expectancy)) | 0.6134 | 0.6105 | - |
Model-2 (GDP per capita+Life expectancy) | 0.6601 | 0.655 | GDP per capita & Life Expectancy. Life expectancy > GDP per capita |
Model-2 (Log(GDP per capita+Life expectancy)) | 0.6129 | 0.61 | - |
Model-3 (GDP per capita+Urbanization rate +Life expectancy + Suicide rate + population growth) | 0.6601 | 0.655 | GDP per capita & Life Expectancy |
Model-3 (Log(GDP per capita+Urbanization rate +Life expectancy + Suicide rate + population growth)) | 0.6136 | 0.6107 | GDP per capita & Life Expectancy. Life expectancy > GDP per capita |
It can be derived from the above table that the adjusted R-Squared for Model-2 is better than other two models, hence it can be said that Model-2 is a good fit.
Application of Logarithm on the control variables reduced the R-squared values in all the 3 models.
Hypothesis testing conclusion:
Turns out we have evidence for all the 3 hypothesis but,
------
As we have concluded that Model-2 is a goof fit, Lets divide the happy dataset into training and test and plot the predicted happiness scores for Model-2.
Call:
lm(formula = Happiness ~ GDP.per.capita + Life.expectancy, data = training_set)
Residuals:
Min 1Q Median 3Q Max
-1.9617 -0.4055 0.0248 0.4922 1.1547
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.041e-02 7.946e-01 -0.076 0.94
GDP.per.capita 2.117e-05 4.330e-06 4.889 3.65e-06 ***
Life.expectancy 6.975e-02 1.184e-02 5.892 4.65e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6397 on 105 degrees of freedom
Multiple R-squared: 0.6834, Adjusted R-squared: 0.6774
F-statistic: 113.3 on 2 and 105 DF, p-value: < 2.2e-16
y_pred_lm = predict(regressor_lm, newdata = test_set)
Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = test_set$Happiness))
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm
The Model shows 67% accuracy on the test data.
Model-1 Diagnostics:
Residuals vs Fitted: The Linearity assumption is met and there is a constant variance of residuals along the line. Overall the plot looks good.
Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.
Scale-Location: The line is approximately horizontal, so I can say that the constant variance assumption is met.
Cook’s distance: No observation is greater than 1. The maximum cook’s distance is <0.3. 174th observation is the influential.
Residuals vs Leverage: No observations are beyond the cook’s distance and the leverage is low.
But the influential observation 174 is not at the same place in all graphs which is odd.
Model-2 Diagnostics:
Residuals vs Fitted: I would say the linearity assumption is violated here but the constant variance of residuals is present.
Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.
Scale-Location: Compared to model-1, the line is almost horizontal. This one looks good.
Cook’s distance: No observation is greater than 1. The maximum cook’s distance is 0.9 (approximately). The 156th observation is very influential.
Residuals vs Leverage: No observations are beyond the cook’s distance and the leverage is high as compared to model-1 which is good.
I can say that the regression diagnostics of Model-2 look much better than Model-1 because of the larger cook’s distance and leverage.
Residuals vs Fitted: The Linearity assumption is met and there is a constant variance of residuals along the line. Overall the plot looks good.
Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.
Scale-Location: Compared to model-1 & 2, the line is not horizontal.
Cook’s distance: No observation is greater than 1. The maximum cook’s distance is < 0.3 (approximately). Again as similar to model-1, the 174th observation is very influential.
Residuals vs Leverage: No observations are beyond the cook’s distance and the leverage is low.
But the influential observation 174 is not at the same place in all graphs which is odd.
The aim of this study is to find out which factors significantly effect world happiness scores and to find out a model that fits/predicts the happiness scores. The following is the summary of this study.
GDP per capita, Urbanization rate and Life expectancy are the factors that are significantly associated with Happiness scores.
North America, Europe and Australia appear to be the happiest.
South America, North Asia, Western Africa and countries/islands in the indian ocean appear to be happy on average.
South Africa, India are unhappy and of course Afghanistan is the unhappiest country in the world.
Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.
The correlation between GDP per capita and happiness score in Europe, North America, and Asia is more significant than the other continents.
The relationship between happiness score and urbanization rate is very significant in Asia, Africa and Europe followed by North america and south america.
The relationship between happiness score and life expectancy is very significant in Asia, Europe and Africa followed by North america and south america.
When it comes to modeling and hypothesis testing,
Model-1 concluded that GDP per capita and Life Expectancy are strongly associated with the happiness scores.
Model-2 concluded that Life Expectancy is more strongly associated with the happiness scores, giving evidence to the third hypothesis.
Regression diagnostics of model-2 showed larger cook distance and leverage.
---
title: "Final Project"
author: "Niharika Pola"
description: "Final Project"
date: "12/13/2022"
format:
html:
df-print: paged
css: styles.css
toc: true
code-fold: true
code-copy: true
code-tools: true
categories:
- Final Project
---
```{r message=FALSE, warning=FALSE}
library(tidyverse)
library(ggplot2)
library(readr)
library(corrplot)
library(reshape2)
library(corrgram)
library(ggpubr)
library(caTools)
library(GGally)
library(stats)
library(ggsci)
```
# Happiness Explained: Through Multiple Linear Regression.
## Background/Motivation
What is takes for a country or a continent to be happy or sad? Is it the higher GDP per capita, lower Population growth, lower Suicide rate or Urbanization? What are the factors that affect a country's or continents overall happiness? Can we predict the happiness score? The curiosity to find answers to these questions made me explore the world happiness scores and world data by country.
The World Happiness Report is a landmark survey of the state of global happiness . The report continues to gain global recognition as governments, organizations and civil society increasingly use happiness indicators to inform their policy-making decisions. Leading experts across fields -- economics, psychology, survey analysis, national statistics, health, public policy and more -- describe how measurements of well-being can be used effectively to assess the progress of nations. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness.
The data sets that are used in this analysis are 1) [world happiness report 2021](https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021) and 2) kaggle's [world data by country 2020](https://www.kaggle.com/datasets/daniboy370/world-data-by-country-2020). Happiness scores from first data set and few variables from the second data set are being used to analyse the following research question.
## Research Question
1. What are the variables or factors that are affecting a country's happiness?
2. To find out which model fits/accurately predicts the happiness score.
## Hypothesis
I wish to test the following hypothesis,
1. Higher/Better GDP per capital leads to a country's happiness.
2. Urbanization leads to a country's happiness.
3. Higher Life Expectancy leads to country's happiness.
This study aims to test the above hypothesis while comparing the GDP per capital and Urbanization with other variables including: Population growth, Life Expectancy, Sex-Ratio and Suicide Rate.
### Reading & Preparing the data sets
```{r}
#dataset-1
happiness <- read_csv("project datasets/world-happiness-report-2021.csv")
hap_sc <- select(happiness, "Country name", "Ladder score")
hap_sc <- rename(hap_sc, "Country"="Country name", "Happiness"="Ladder score")
colnames(hap_sc)
#datasets from world data by country
GDP_per_capita <- read_csv("project datasets/GDP per capita.csv")
Urbanization_rate <- read_csv("project datasets/Urbanization rate.csv")
Population_growth <- read_csv("project datasets/Population growth.csv")
Life_expectancy <- read_csv("project datasets/Life expectancy.csv")
Suicide_rate <- read_csv("project datasets/Suicide rate.csv")
Sex_ratio <- read_csv("project datasets/Sex-ratio.csv")
#merging the above datasets by country and iso code
world_data <- merge(GDP_per_capita, Urbanization_rate)
world_data <- merge(world_data, Population_growth)
world_data <- merge(world_data, Life_expectancy)
world_data <- merge(world_data, Suicide_rate)
world_data <- merge(world_data, Sex_ratio)
#merging happiness scores and world data
happy <- merge(world_data, hap_sc)
happy <- rename(happy, "GDP.per.capita"="GDP per capita", "urbanization.rate"="Urbanization rate", "Population.growth"="Population growth", "Life.expectancy"="Life expectancy", "Suicide.rate"="Suicide rate", "Sex.ratio"="Sex-ratio")
str(happy)
```
```{r}
#adding continents data to final dataset
happy$Continent <- NA
happy$Continent[which(happy$Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China",
"Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan",
"South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines",
"Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam",
"Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories",
"Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia",
"Cambodia", "Afghanistan", "Yemen", "Syria"))] <- "Asia"
happy$Continent[which(happy$Country %in% c("Norway", "Denmark", "Iceland", "Switzerland", "Finland",
"Netherlands", "Sweden", "Austria", "Ireland", "Germany",
"Belgium", "Luxembourg", "United Kingdom", "Czech Republic",
"Malta", "France", "Spain", "Slovakia", "Poland", "Italy",
"Russia", "Lithuania", "Latvia", "Moldova", "Romania",
"Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus",
"Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro",
"Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia",
"Bulgaria", "Albania", "Ukraine"))] <- "Europe"
happy$Continent[which(happy$Country %in% c("Canada", "Costa Rica", "United States", "Mexico",
"Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala",
"Jamaica", "Nicaragua", "Dominican Republic", "Honduras",
"Haiti"))] <- "North America"
happy$Continent[which(happy$Country %in% c("Chile", "Brazil", "Argentina", "Uruguay",
"Colombia", "Ecuador", "Bolivia", "Peru",
"Paraguay", "Venezuela"))] <- "South America"
happy$Continent[which(happy$Country %in% c("New Zealand", "Australia"))] <- "Australia"
happy$Continent[which(is.na(happy$Continent))] <- "Africa"
# Moving the continent column's position in the dataset to the second column
happy <- happy %>% select(Country,Continent, everything())
happy <- happy[!duplicated(happy), ]
#Renaming the final dataframe to happy
view(happy)
happy <- happy[-(48:110), ]
str(happy)
```
The final data set "happy" consists of 12 variables and 136 observations. We can start the analysis now.
# Exploratory Data Analysis
## Understanding the correlation between happiness scores and other variables.
```{r}
# Create a correlation plot
ggcorr(happy, label = TRUE, label_round = 2, label_size = 3.5, size = 4, hjust = .85) +
ggtitle("Plot- 1: Correlation Heatmap") +
theme(plot.title = element_text(hjust = 0.5))
```
Observations from the correlation plot:
The happiness score is,
- Strongly correlating with GDP per capita, Urbanization rate and Life expectancy.
- Inversely correlated with suicidal rate and population growth.
- No correlation with sex ratio.
```{r}
#plotting the happiness data on world map
plot_2 <- select(happy, "Continent", "Country", "Happiness")
library(maps)
world.data <- map_data("world")
map.world_joined <- left_join(world.data, plot_2, by = c('region'='Country'))
map1 <- ggplot(map.world_joined, aes(x=long, y=lat, group=group)) +
geom_polygon(aes(fill=Happiness), color="black")
map2 <- map1 + scale_fill_gradient(name = "Happiness Score", low = "yellow", high = "red", na.value = "grey50") +
theme(axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.y = element_blank(),
axis.title.x = element_blank(),
rect = element_blank())
map2
```
North America, Europe and Australia appear to be the happiest.
South America, North Asia, Western Africa and countries/islands in the indian ocean appear to be happy on average.
South Africa, India are unhappy and of course Afghanistan is the unhappiest country in the world.
### Happiness score comparison on different continents
```{r message=FALSE, warning=FALSE, error=FALSE}
####### Happiness score for each continent
gg1 <- ggplot(happy,
aes(x=Continent,
y=Happiness,
color=Continent))+
geom_point() + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg2 <- ggplot(happy , aes(x = Continent, y = Happiness)) +
geom_boxplot(aes(fill=Continent)) + theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
gg3 <- ggplot(happy,aes(x=Continent,y=Happiness))+
geom_violin(aes(fill=Continent),alpha=0.7)+ theme_bw() +
theme(axis.title = element_text(family = "Helvetica", size = (8)))
# Compute descriptive statistics by groups
stable <- desc_statby(happy, measure.var = "Happiness",
grps = "Continent")
stable <- stable[, c("Continent","mean","median")]
names(stable) <- c("Continent", "Mean of happiness score","Median of happiness score")
# Summary table plot
stable.p <- ggtexttable(stable,rows = NULL,
theme = ttheme("classic"))
ggarrange(gg1, gg2, ncol = 1, nrow = 2)
```
```{r message=FALSE, warning=FALSE, error=FALSE}
ggarrange(gg3, stable.p, ncol = 1, nrow = 2)
```
As we have seen before, Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.
Let's see the correlation between happiness score and 3 major factors in the happiness data set that are GDP per capita, Life Expectancy, Urbanization rate to understand how they are varying.
```{r}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = GDP.per.capita, y = Happiness)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line Happiness vs GDP per capita")
```
The correlation between GDP per capita and happiness score in Europe, North America, and Asia is more significant than the other continents. Worth mentioning that we will not take Australia into account because there are just two countries in Australia and creating scatter plot with the regression line for this continent will not give us any insight.
```{r}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = urbanization.rate, y = Happiness)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line Happiness vs Urbanization rate")
```
The relationship between happiness score and urbanization rate is very significant in Asia, Africa and Europe followed by North america and south america.
```{r}
ggplot(subset(happy, happy$Continent != "Australia"), aes(x = Life.expectancy, y = Happiness)) +
geom_point(aes(color=Continent), size = 3, alpha = 0.8) +
geom_smooth(aes(color = Continent, fill = Continent),
method = "lm", fullrange = TRUE) +
facet_wrap(~Continent) +
theme_bw() + labs(title = "Scatter plot with regression line Happiness vs Life Expectancy")
```
The relationship between happiness score and life expectancy is very significant in Asia, Europe and Africa followed by North america and south america.
## Modeling & Hypothesis testing
From the above visualizations and analysis, we have found that GDP per capita, Urbanization and Life Expectancy play major role in the happiness score. Whereas suicide rate and population growth has an inverse relationship. Let's build a model for estimating happiness scores based on these control variables and see which of these has effect on the overall happiness scores.
### Model-1
```{r}
#Model-1
model1 <- lm(Happiness ~ GDP.per.capita + urbanization.rate + Life.expectancy, data = happy)
summary(model1)
```
It can be seen that p-value of the F-statistic is \< 2.2e-16, which is highly significant. This means that, at least, one of the predictor/control variables ( GDP per capita, Urbanization rare and Life Expectancy) is significantly related to the outcome variable (Happiness score).
It can be seen that, changing in GDP per capita and Life expectancy are significantly associated to changes in happiness scores while changes in urbanization rate is not significantly associated with sales.
```{r}
summary(model1)$coefficient
```
```{r}
confint(model1)
```
Lets see how this model behaves if we apply log on the variables
```{r}
#Model-1 with log
model1_log <- lm(Happiness ~ log(GDP.per.capita + urbanization.rate + Life.expectancy), data = happy)
summary(model1_log)
```
The R-Squared value drops from 0.6693 to 0.6134 by applying log for model-1.
### Model-2
```{r}
#Model-2
model2 <- lm(Happiness ~ GDP.per.capita + Life.expectancy, data = happy)
summary(model2)
```
```{r}
summary(model2)$coefficient
```
It turns out Life Expectancy is strongly associated with happiness scores and GDP per capita follows life expectancy.
```{r}
confint(model2)
```
However the R-squared values of Model-1 and Model-2 are approximately around 65-66%. Hence we cannot say which model is best fit to predict the happiness scores.
Lets consider all the 5 control variables and see the results.
Lets see how model-2 behaves if we apply log on the variables
```{r}
#Model-2 with log
model2_log <- lm(Happiness ~ log(GDP.per.capita + Life.expectancy), data = happy)
summary(model2_log)
```
The R-Squared value drops from 0.66 to 0.61 by applying log for model-2.
### Model-3
```{r}
#Model-3
model3 <- lm(Happiness ~ GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth , data = happy)
summary(model3)
```
The R-squared remains same after adding additional two control variables. So far, the second model fits the happiness scores.
Lets see how model-3 behaves if we apply log on the control variables
```{r}
#Model-3 with log
model3_log <- lm(Happiness ~ log(GDP.per.capita + urbanization.rate + Life.expectancy + Suicide.rate + Population.growth) , data = happy)
summary(model3_log)
```
The R-squared values drops from 0.66 to 0.61 by application of log.
## Model comprision
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
| Models | Multiple R-squared | Adjusted R-squared | Significant Variables |
+=====================================================================================================+=====================+====================+===================================+
| Model-1 (GDP per capita+Urbanization rate+Life expectancy) | 0.6693 | 0.6618 | GDP per capita & Life Expectancy |
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
| Model-1 (Log(GDP per capita+Urbanization rate+Life expectancy)) | 0.6134 | 0.6105 | \- |
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
| Model-2 (GDP per capita+Life expectancy) | 0.6601 | 0.655 | GDP per capita & Life Expectancy. |
| | | | |
| | | | Life expectancy \> GDP per capita |
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
| Model-2 (Log(GDP per capita+Life expectancy)) | 0.6129 | 0.61 | \- |
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
| Model-3 (GDP per capita+Urbanization rate +Life expectancy + Suicide rate + population growth) | 0.6601 | 0.655 | GDP per capita & Life Expectancy |
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
| Model-3 (Log(GDP per capita+Urbanization rate +Life expectancy + Suicide rate + population growth)) | 0.6136 | 0.6107 | GDP per capita & Life Expectancy. |
| | | | |
| | | | Life expectancy \> GDP per capita |
+-----------------------------------------------------------------------------------------------------+---------------------+--------------------+-----------------------------------+
- It can be derived from the above table that the adjusted R-Squared for Model-2 is better than other two models, hence it can be said that Model-2 is a good fit.
- Application of Logarithm on the control variables reduced the R-squared values in all the 3 models.
- **Hypothesis testing conclusion:**
Turns out we have evidence for all the 3 hypothesis but,
- We have a strong evidence for hypothesis-3 i.e. Higher Life Expectancy leads to country's happiness.
\-\-\-\-\--
As we have concluded that Model-2 is a goof fit, Lets divide the happy dataset into training and test and plot the predicted happiness scores for Model-2.
```{r}
# Splitting the dataset into the Training set and Test set
set.seed(123)
dataset <- happy[4:10]
split = sample.split(dataset$Happiness, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
```
```{r}
# Fitting Multiple Linear Regression to the Training set on Model-2
regressor_lm = lm(formula = Happiness ~ GDP.per.capita + Life.expectancy,
data = training_set)
summary(regressor_lm)
```
```{r message=FALSE, warning=FALSE, error=FALSE}
y_pred_lm = predict(regressor_lm, newdata = test_set)
Pred_Actual_lm <- as.data.frame(cbind(Prediction = y_pred_lm, Actual = test_set$Happiness))
gg.lm <- ggplot(Pred_Actual_lm, aes(Actual, Prediction )) +
geom_point() + theme_bw() + geom_abline() +
labs(title = "Multiple Linear Regression", x = "Actual happiness score",
y = "Predicted happiness score") +
theme(plot.title = element_text(family = "Helvetica", face = "bold", size = (15)),
axis.title = element_text(family = "Helvetica", size = (10)))
gg.lm
```
**The Model shows 67% accuracy on the test data.**
## Regression Diagnostics
### Model-1
```{r}
#Model-1
par(mfrow = c(2,3)); plot(model1, which = 1:6)
```
**Model-1 Diagnostics:**
Residuals vs Fitted: The Linearity assumption is met and there is a constant variance of residuals along the line. Overall the plot looks good.
Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.
Scale-Location: The line is approximately horizontal, so I can say that the constant variance assumption is met.
Cook's distance: No observation is greater than 1. The maximum cook's distance is \<0.3. 174th observation is the influential.
Residuals vs Leverage: No observations are beyond the cook's distance and the leverage is low.
But the influential observation 174 is not at the same place in all graphs which is odd.
### Model-2
```{r}
#Model-2
par(mfrow = c(2,3)); plot(model2, which = 1:6)
```
**Model-2 Diagnostics:**
Residuals vs Fitted: I would say the linearity assumption is violated here but the constant variance of residuals is present.
Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.
Scale-Location: Compared to model-1, the line is almost horizontal. This one looks good.
Cook's distance: No observation is greater than 1. The maximum cook's distance is 0.9 (approximately). The 156th observation is very influential.
Residuals vs Leverage: No observations are beyond the cook's distance and the leverage is high as compared to model-1 which is good.
**I can say that the regression diagnostics of Model-2 look much better than Model-1 because of the larger cook's distance and leverage.**
### Model-3
```{r}
#Model-3
par(mfrow = c(2,3)); plot(model3, which = 1:6)
```
Residuals vs Fitted: The Linearity assumption is met and there is a constant variance of residuals along the line. Overall the plot looks good.
Normal Q-Q: The points are falling along the theoretical quantiles, hence the normality assumption is met.
Scale-Location: Compared to model-1 & 2, the line is not horizontal.
Cook's distance: No observation is greater than 1. The maximum cook's distance is \< 0.3 (approximately). Again as similar to model-1, the 174th observation is very influential.
Residuals vs Leverage: No observations are beyond the cook's distance and the leverage is low.
But the influential observation 174 is not at the same place in all graphs which is odd.
## Conclusion
The aim of this study is to find out which factors significantly effect world happiness scores and to find out a model that fits/predicts the happiness scores. The following is the summary of this study.
- GDP per capita, Urbanization rate and Life expectancy are the factors that are significantly associated with Happiness scores.
- North America, Europe and Australia appear to be the happiest.
- South America, North Asia, Western Africa and countries/islands in the indian ocean appear to be happy on average.
- South Africa, India are unhappy and of course Afghanistan is the unhappiest country in the world.
- Australia has the highest median happiness score. Europe, South America, and North America are in the second place regarding median happiness score. Asia has the lowest median after Africa. We can see the range of happiness score for different continents, and also the concentration of happiness score.
- The correlation between GDP per capita and happiness score in Europe, North America, and Asia is more significant than the other continents.
- The relationship between happiness score and urbanization rate is very significant in Asia, Africa and Europe followed by North america and south america.
- The relationship between happiness score and life expectancy is very significant in Asia, Europe and Africa followed by North america and south america.
When it comes to modeling and hypothesis testing,
- Model-1 concluded that GDP per capita and Life Expectancy are strongly associated with the happiness scores.
- Model-2 concluded that Life Expectancy is more strongly associated with the happiness scores, giving evidence to the third hypothesis.
- Regression diagnostics of model-2 showed larger cook distance and leverage.
## References
1. <https://www.kaggle.com/code/noobiedatascientist/explaining-happiness/data>
2. <https://worldhappiness.report/ed/2021/>
3. <http://www.sthda.com/english/articles/40-regression-analysis/168-multiple-linear-regression-in-r/#building-model>
4. [https://www.guru99.com/r-simple-multiple-linear-regression.html](https://www.guru99.com/r-simple-multiple-linear-regression.htmlhttps://www.guru99.com/r-simple-multiple-linear-regression.html)